CrowdTruth for Recognizing Textual Entailment Annotation

This analysis uses the data gathered in the "Recognizing Textual Entailment" crowdsourcing experiment published in Rion Snow, Brendan O’Connor, Dan Jurafsky, and Andrew Y. Ng: Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP 2008, pages 254–263.

Task Description: Given two sentences, the crowd has to choose whether the second hypothesis sentence can be inferred from the first sentence (binary choice, true/false). Following, we provide an example from the aforementioned publication:

Text: “Crude Oil Prices Slump”

Hypothesis: “Oil prices drop”

A screenshot of the task as it appeared to workers can be seen at the following repository.

The dataset for this task was downloaded from the following repository, which contains the raw output from the crowd on AMT. Currently, you can find the processed input file in the folder named data. Besides the raw crowd annotations, the processed file also contains the text and the hypothesis that needs to be tested with the given text, which were given as input to the crowd.



In [1]:

    
import pandas as pd

test_data = pd.read_csv("../data/rte.standardized.csv")
test_data.head()









    Out[1]:







  
    
      
      !amt_annotation_ids
      !amt_worker_ids
      orig_id
      response
      gold
      start
      end
      hypothesis
      task
      text
    
  
  
    
      0
      1
      A2K5ICP43ML4PW
      25
      not_relevant
      0
      Mon Mar 25 07:39:42 PDT 2019
      Mon Mar 25 07:41:05 PDT 2019
      Two films won six Oscars.
      IR
      The film was the evening&apos;s big winner, ba...
    
    
      1
      2
      A15L6WGIK3VU7N
      25
      not_relevant
      0
      Mon Mar 25 07:39:42 PDT 2019
      Mon Mar 25 07:41:05 PDT 2019
      Two films won six Oscars.
      IR
      The film was the evening&apos;s big winner, ba...
    
    
      2
      3
      AHPSMRLKAEJV
      25
      not_relevant
      0
      Mon Mar 25 07:39:42 PDT 2019
      Mon Mar 25 07:41:05 PDT 2019
      Two films won six Oscars.
      IR
      The film was the evening&apos;s big winner, ba...
    
    
      3
      4
      A25QX7IUS1KI5E
      25
      not_relevant
      0
      Mon Mar 25 07:39:42 PDT 2019
      Mon Mar 25 07:41:05 PDT 2019
      Two films won six Oscars.
      IR
      The film was the evening&apos;s big winner, ba...
    
    
      4
      5
      A2RV3FIO3IAZS
      25
      not_relevant
      0
      Mon Mar 25 07:39:42 PDT 2019
      Mon Mar 25 07:41:05 PDT 2019
      Two films won six Oscars.
      IR
      The film was the evening&apos;s big winner, ba...

Declaring a pre-processing configuration

The pre-processing configuration defines how to interpret the raw crowdsourcing input. To do this, we need to define a configuration class. First, we import the default CrowdTruth configuration class:



In [2]:

    
import crowdtruth
from crowdtruth.configuration import DefaultConfig

Our test class inherits the default configuration DefaultConfig, while also declaring some additional attributes that are specific to the Recognizing Textual Entailment task:

inputColumns: list of input columns from the .csv file with the input data
outputColumns: list of output columns from the .csv file with the answers from the workers
customPlatformColumns: a list of columns from the .csv file that defines a standard annotation tasks, in the following order - judgment id, unit id, worker id, started time, submitted time. This variable is used for input files that do not come from AMT or FigureEight (formarly known as CrowdFlower).
annotation_separator: string that separates between the crowd annotations in outputColumns
open_ended_task: boolean variable defining whether the task is open-ended (i.e. the possible crowd annotations are not known beforehand, like in the case of free text input); in the task that we are processing, workers pick the answers from a pre-defined list, therefore the task is not open ended, and this variable is set to False
annotation_vector: list of possible crowd answers, mandatory to declare when open_ended_task is False; for our task, this is the list of relations
processJudgments: method that defines processing of the raw crowd data; for this task, we process the crowd answers to correspond to the values in annotation_vector

The complete configuration class is declared below:



In [3]:

    
class TestConfig(DefaultConfig):
    inputColumns = ["gold", "task", "text", "hypothesis"]
    outputColumns = ["response"]
    customPlatformColumns = ["!amt_annotation_ids", "orig_id", "!amt_worker_ids", "start", "end"]
    
    # processing of a closed task
    open_ended_task = False
    annotation_vector = ["relevant", "not_relevant"]
    
    def processJudgments(self, judgments):
        # pre-process output to match the values in annotation_vector
        for col in self.outputColumns:
            # transform to lowercase
            judgments[col] = judgments[col].apply(lambda x: str(x).lower())
        return judgments

Pre-processing the input data

After declaring the configuration of our input file, we are ready to pre-process the crowd data:



In [4]:

    
data, config = crowdtruth.load(
    file = "../data/rte.standardized.csv",
    config = TestConfig()
)

data['judgments'].head()









    Out[4]:







  
    
      
      output.response
      output.response.count
      output.response.unique
      started
      unit
      submitted
      worker
      duration
      job
    
    
      judgment
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      {u'not_relevant': 1, u'relevant': 0}
      1
      2
      2019-03-25 07:39:42-07:00
      25
      2019-03-25 07:41:05-07:00
      A2K5ICP43ML4PW
      83
      ../data/rte.standardized
    
    
      2
      {u'not_relevant': 1, u'relevant': 0}
      1
      2
      2019-03-25 07:39:42-07:00
      25
      2019-03-25 07:41:05-07:00
      A15L6WGIK3VU7N
      83
      ../data/rte.standardized
    
    
      3
      {u'not_relevant': 1, u'relevant': 0}
      1
      2
      2019-03-25 07:39:42-07:00
      25
      2019-03-25 07:41:05-07:00
      AHPSMRLKAEJV
      83
      ../data/rte.standardized
    
    
      4
      {u'not_relevant': 1, u'relevant': 0}
      1
      2
      2019-03-25 07:39:42-07:00
      25
      2019-03-25 07:41:05-07:00
      A25QX7IUS1KI5E
      83
      ../data/rte.standardized
    
    
      5
      {u'not_relevant': 1, u'relevant': 0}
      1
      2
      2019-03-25 07:39:42-07:00
      25
      2019-03-25 07:41:05-07:00
      A2RV3FIO3IAZS
      83
      ../data/rte.standardized

Computing the CrowdTruth metrics

The pre-processed data can then be used to calculate the CrowdTruth metrics:



In [5]:

    
results = crowdtruth.run(data, config)

results is a dict object that contains the quality metrics for the sentences, annotations and crowd workers.

The sentence metrics are stored in results["units"]:



In [6]:

    
results["units"].head()









    Out[6]:







  
    
      
      duration
      input.gold
      input.hypothesis
      input.task
      input.text
      job
      output.response
      output.response.annotations
      output.response.unique_annotations
      worker
      uqs
      unit_annotation_score
      uqs_initial
      unit_annotation_score_initial
    
    
      unit
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      25
      83
      0
      Two films won six Oscars.
      IR
      The film was the evening&apos;s big winner, ba...
      ../data/rte.standardized
      {u'not_relevant': 8, u'relevant': 2}
      10
      2
      10
      0.754990
      {u'not_relevant': 0.874879061295, u'relevant':...
      0.644444
      {u'not_relevant': 0.8, u'relevant': 0.2}
    
    
      35
      83
      1
      Saudi Arabia is the world&apos;s biggest oil e...
      PP
      Saudi Arabia, the biggest oil producer in the ...
      ../data/rte.standardized
      {u'not_relevant': 6, u'relevant': 4}
      10
      2
      10
      0.529058
      {u'not_relevant': 0.700282390791, u'relevant':...
      0.466667
      {u'not_relevant': 0.6, u'relevant': 0.4}
    
    
      39
      83
      1
      Bill Clinton received a reported $10 million a...
      PP
      Mr. Clinton received a hefty advance for the b...
      ../data/rte.standardized
      {u'not_relevant': 1, u'relevant': 9}
      10
      2
      10
      0.877036
      {u'not_relevant': 0.0580681587159, u'relevant'...
      0.800000
      {u'not_relevant': 0.1, u'relevant': 0.9}
    
    
      48
      83
      1
      Clinton is articulate.
      PP
      Clinton is a very charismatic person.
      ../data/rte.standardized
      {u'not_relevant': 6, u'relevant': 4}
      10
      2
      10
      0.526438
      {u'not_relevant': 0.697360027313, u'relevant':...
      0.466667
      {u'not_relevant': 0.6, u'relevant': 0.4}
    
    
      49
      83
      1
      Argentina sees upsurge in kidnappings.
      IR
      Kidnappings in Argentina have increased more t...
      ../data/rte.standardized
      {u'not_relevant': 0, u'relevant': 10}
      10
      1
      10
      1.000000
      {u'not_relevant': 0.0, u'relevant': 1.0}
      1.000000
      {u'not_relevant': 0.0, u'relevant': 1.0}

The uqs column in results["units"] contains the sentence quality scores, capturing the overall workers agreement over each sentences. Here we plot its histogram:



In [7]:

    
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = 15, 5

plt.subplot(1, 2, 1)
plt.hist(results["units"]["uqs"])
plt.ylim(0,270)
plt.xlabel("Sentence Quality Score")
plt.ylabel("#Sentences")

plt.subplot(1, 2, 2)
plt.hist(results["units"]["uqs_initial"])
plt.ylim(0,270)
plt.xlabel("Sentence Quality Score Initial")
plt.ylabel("# Units")









    Out[7]:





Text(0,0.5,'# Units')

Plot the change in unit qualtity score at the beginning of the process and at the end



In [8]:

    
import numpy as np

sortUQS = results["units"].sort_values(['uqs'], ascending=[1])
sortUQS = sortUQS.reset_index()

plt.rcParams['figure.figsize'] = 15, 5

plt.plot(np.arange(sortUQS.shape[0]), sortUQS["uqs_initial"], 'ro', lw = 1, label = "Initial UQS")
plt.plot(np.arange(sortUQS.shape[0]), sortUQS["uqs"], 'go', lw = 1, label = "Final UQS")

plt.ylabel('Sentence Quality Score')
plt.xlabel('Sentence Index')









    Out[8]:





Text(0.5,0,'Sentence Index')

The unit_annotation_score column in results["units"] contains the sentence-annotation scores, capturing the likelihood that an annotation is expressed in a sentence. For each sentence, we store a dictionary mapping each annotation to its sentence-relation score.



In [9]:

    
results["units"]["unit_annotation_score"].head()









    Out[9]:





unit
25    {u'not_relevant': 0.874879061295, u'relevant':...
35    {u'not_relevant': 0.700282390791, u'relevant':...
39    {u'not_relevant': 0.0580681587159, u'relevant'...
48    {u'not_relevant': 0.697360027313, u'relevant':...
49             {u'not_relevant': 0.0, u'relevant': 1.0}
Name: unit_annotation_score, dtype: object

Save unit metrics:



In [10]:

    
rows = []
header = ["orig_id", "gold", "hypothesis", "text", "uqs", "uqs_initial", "true", "false", "true_initial", "false_initial"]

units = results["units"].reset_index()
for i in range(len(units.index)):
    row = [units["unit"].iloc[i], units["input.gold"].iloc[i], units["input.hypothesis"].iloc[i], \
           units["input.text"].iloc[i], units["uqs"].iloc[i], units["uqs_initial"].iloc[i], \
           units["unit_annotation_score"].iloc[i]["relevant"], units["unit_annotation_score"].iloc[i]["not_relevant"], \
           units["unit_annotation_score_initial"].iloc[i]["relevant"], units["unit_annotation_score_initial"].iloc[i]["not_relevant"]]
    rows.append(row)
rows = pd.DataFrame(rows, columns=header)
rows.to_csv("../data/results/crowdtruth_units_rte.csv", index=False)

The worker metrics are stored in results["workers"]:



In [11]:

    
results["workers"].head()









    Out[11]:







  
    
      
      duration
      job
      judgment
      unit
      wqs
      wwa
      wsa
      wqs_initial
      wwa_initial
      wsa_initial
    
    
      worker
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      A11GX90QFWDLMM
      83
      1
      760
      760
      0.342472
      0.544462
      0.629009
      0.334467
      0.525292
      0.636726
    
    
      A14JQX7IFAICP0
      83
      1
      180
      180
      0.414908
      0.595823
      0.696362
      0.379697
      0.555556
      0.683455
    
    
      A14Q86RX5HGCN
      83
      1
      20
      20
      0.821105
      0.852668
      0.962983
      0.733456
      0.788889
      0.929733
    
    
      A14WWG6NKBDWGP
      83
      1
      20
      20
      0.678581
      0.754595
      0.899265
      0.571740
      0.677778
      0.843551
    
    
      A151VN1BOY29J1
      83
      1
      40
      40
      0.690083
      0.780349
      0.884326
      0.582732
      0.697222
      0.835791

The wqs columns in results["workers"] contains the worker quality scores, capturing the overall agreement between one worker and all the other workers.



In [12]:

    
plt.rcParams['figure.figsize'] = 15, 5

plt.subplot(1, 2, 1)
plt.hist(results["workers"]["wqs"])
plt.ylim(0,55)
plt.xlabel("Worker Quality Score")
plt.ylabel("#Workers")

plt.subplot(1, 2, 2)
plt.hist(results["workers"]["wqs_initial"])
plt.ylim(0,55)
plt.xlabel("Worker Quality Score Initial")
plt.ylabel("#Workers")









    Out[12]:





Text(0,0.5,'#Workers')

Save the worker metrics:



In [13]:

    
results["workers"].to_csv("../data/results/crowdtruth_workers_rte.csv", index=True)

The annotation metrics are stored in results["annotations"]. The aqs column contains the annotation quality scores, capturing the overall worker agreement over one relation.



In [14]:

    
results["annotations"]









    Out[14]:







  
    
      
      output.response
      aqs
      aqs_initial
    
  
  
    
      not_relevant
      8000
      0.706485
      0.616694
    
    
      relevant
      8000
      0.793250
      0.715569



In [15]:

    
sortedUQS = results["units"].sort_values(["uqs"])

Example of a very clear unit



In [16]:

    
sortedUQS.tail(1)









    Out[16]:







  
    
      
      duration
      input.gold
      input.hypothesis
      input.task
      input.text
      job
      output.response
      output.response.annotations
      output.response.unique_annotations
      worker
      uqs
      unit_annotation_score
      uqs_initial
      unit_annotation_score_initial
    
    
      unit
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1017
      83
      1
      The Pamplona fiesta has been celebrated for ce...
      RC
      The centuries-old Pamplona fiesta in honor of ...
      ../data/rte.standardized
      {u'not_relevant': 0, u'relevant': 10}
      10
      1
      10
      1.0
      {u'not_relevant': 0.0, u'relevant': 1.0}
      1.0
      {u'not_relevant': 0.0, u'relevant': 1.0}



In [17]:

    
print("Hypothesis: %s" % sortedUQS["input.hypothesis"].iloc[len(sortedUQS.index)-1])
print("Text: %s" % sortedUQS["input.text"].iloc[len(sortedUQS.index)-1])
print("Expert Answer: %s" % sortedUQS["input.gold"].iloc[len(sortedUQS.index)-1])
print("Crowd Answer with CrowdTruth: %s" % sortedUQS["unit_annotation_score"].iloc[len(sortedUQS.index)-1])
print("Crowd Answer without CrowdTruth: %s" % sortedUQS["unit_annotation_score_initial"].iloc[len(sortedUQS.index)-1])









    



Hypothesis: The Pamplona fiesta has been celebrated for centuries.
Text: The centuries-old Pamplona fiesta in honor of St. Fermin draws thousands of fans from around the world for a week of frenzied late night drinking and early morning running with the bulls.
Expert Answer: 1
Crowd Answer with CrowdTruth: Counter({'relevant': 1.0, 'not_relevant': 0.0})
Crowd Answer without CrowdTruth: Counter({'relevant': 1.0, 'not_relevant': 0.0})

Example of an unclear unit



In [18]:

    
sortedUQS.head(1)









    Out[18]:







  
    
      
      duration
      input.gold
      input.hypothesis
      input.task
      input.text
      job
      output.response
      output.response.annotations
      output.response.unique_annotations
      worker
      uqs
      unit_annotation_score
      uqs_initial
      unit_annotation_score_initial
    
    
      unit
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1521
      83
      1
      Apartheid in South Africa was abolished in 1990.
      QA
      On 2 February 1990, at the opening of Parliame...
      ../data/rte.standardized
      {u'not_relevant': 5, u'relevant': 5}
      10
      2
      10
      0.439065
      {u'not_relevant': 0.492508591773, u'relevant':...
      0.444444
      {u'not_relevant': 0.5, u'relevant': 0.5}



In [19]:

    
print("Hypothesis: %s" % sortedUQS["input.hypothesis"].iloc[0])
print("Text: %s" % sortedUQS["input.text"].iloc[0])
print("Expert Answer: %s" % sortedUQS["input.gold"].iloc[0])
print("Crowd Answer with CrowdTruth: %s" % sortedUQS["unit_annotation_score"].iloc[0])
print("Crowd Answer without CrowdTruth: %s" % sortedUQS["unit_annotation_score_initial"].iloc[0])









    



Hypothesis: Apartheid in South Africa was abolished in 1990.
Text: On 2 February 1990, at the opening of Parliament, he declared that apartheid had failed and that the bans on political parties, including the ANC, were to be lifted.
Expert Answer: 1
Crowd Answer with CrowdTruth: Counter({'relevant': 0.5074914082267219, 'not_relevant': 0.4925085917732781})
Crowd Answer without CrowdTruth: Counter({'not_relevant': 0.5, 'relevant': 0.5})

MACE for Recognizing Textual Entailment Annotation

We first pre-processed the crowd results to create compatible files for running the MACE tool. Each row in a csv file should point to a unit in the dataset and each column in the csv file should point to a worker. The content of the csv file captures the worker answer for that particular unit (or remains empty if the worker did not annotate that unit).



In [20]:

    
import numpy as np

test_data = pd.read_csv("../data/mace_rte.standardized.csv", header=None)
test_data = test_data.replace(np.nan, '', regex=True)
test_data.head()









    Out[20]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      154
      155
      156
      157
      158
      159
      160
      161
      162
      163
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      1
      ...
      
      
      
      
      
      
      
      
      
      
    
    
      1
      
      
      
      
      
      
      1
      
      
      1
      ...
      
      
      
      
      
      
      
      
      
      
    
    
      2
      
      
      
      
      
      
      1
      
      
      0
      ...
      
      
      
      
      
      
      
      
      
      
    
    
      3
      
      
      
      
      
      
      1
      
      
      1
      ...
      
      
      
      
      
      
      
      
      
      
    
    
      4
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      ...
      
      
      
      
      
      
      
      
      
      
    
  

5 rows × 164 columns



In [21]:

    
import pandas as pd

mace_data = pd.read_csv("../data/results/mace_units_rte.csv")
mace_data.head()









    Out[21]:







  
    
      
      unit
      true
      false
      gold
    
  
  
    
      0
      25
      9.263818e-06
      9.999907e-01
      0
    
    
      1
      35
      1.353801e-06
      9.999986e-01
      1
    
    
      2
      39
      1.000000e+00
      1.057981e-08
      1
    
    
      3
      48
      5.895417e-07
      9.999994e-01
      1
    
    
      4
      49
      9.999986e-01
      1.420726e-06
      1



In [22]:

    
mace_workers = pd.read_csv("../data/results/mace_workers_rte.csv")
mace_workers.head()









    Out[22]:







  
    
      
      worker
      competence
    
  
  
    
      0
      A2K5ICP43ML4PW
      0.804252
    
    
      1
      A15L6WGIK3VU7N
      0.855198
    
    
      2
      AHPSMRLKAEJV
      0.690429
    
    
      3
      A25QX7IUS1KI5E
      0.473449
    
    
      4
      A2RV3FIO3IAZS
      0.348140

CrowdTruth vs. MACE on Worker Quality



In [23]:

    
mace_workers = pd.read_csv("../data/results/mace_workers_rte.csv")
crowdtruth_workers = pd.read_csv("../data/results/crowdtruth_workers_rte.csv")

mace_workers = mace_workers.sort_values(["worker"])
crowdtruth_workers = crowdtruth_workers.sort_values(["worker"])



In [24]:

    
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

plt.scatter(
    mace_workers["competence"],
    crowdtruth_workers["wqs"],
)

plt.title("Worker Quality Score")
plt.xlabel("MACE")
plt.ylabel("CrowdTruth")









    Out[24]:





Text(0,0.5,'CrowdTruth')



In [25]:

    
sortWQS = crowdtruth_workers.sort_values(['wqs'], ascending=[1])
sortWQS = sortWQS.reset_index()
worker_ids = list(sortWQS["worker"])

mace_workers = mace_workers.set_index('worker')
mace_workers.loc[worker_ids]

plt.rcParams['figure.figsize'] = 15, 5

plt.plot(np.arange(sortWQS.shape[0]), sortWQS["wqs"], 'bo', lw = 1, label = "CrowdTruth Worker Score")
plt.plot(np.arange(mace_workers.shape[0]), mace_workers["competence"], 'go', lw = 1, label = "MACE Worker Score")

plt.ylabel('Worker Quality Score')
plt.xlabel('Worker Index')
plt.legend()









    Out[25]:





<matplotlib.legend.Legend at 0x111e68250>

CrowdTruth vs. MACE vs. Majority Vote on Annotation Performance



In [26]:

    
import pandas as pd 
import numpy as np

majvote = pd.read_csv("../data/results/majorityvote_units_rte.csv")
mace = pd.read_csv("../data/results/mace_units_rte.csv")
crowdtruth = pd.read_csv("../data/results/crowdtruth_units_rte.csv")



In [61]:

    
def compute_F1_score(dataset):
    nyt_f1 = np.zeros(shape=(100, 2))
    for idx in xrange(0, 100):
        thresh = (idx + 1) / 100.0
        tp = 0
        fp = 0
        tn = 0
        fn = 0

        for gt_idx in range(0, len(dataset.index)):
            if dataset['true'].iloc[gt_idx] >= thresh:
                if dataset['gold'].iloc[gt_idx] == 1:
                    tp = tp + 1.0
                else:
                    fp = fp + 1.0
            else:
                if dataset['gold'].iloc[gt_idx] == 1:
                    fn = fn + 1.0
                else:
                    tn = tn + 1.0


        nyt_f1[idx, 0] = thresh
    
        if tp != 0:
            nyt_f1[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)
        else:
            nyt_f1[idx, 1] = 0
    return nyt_f1


def compute_majority_vote(dataset, crowd_column):
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for j in range(len(dataset.index)):
        if dataset['true_initial'].iloc[gt_idx] >= 0.5:
            if dataset['gold'].iloc[gt_idx] == 1:
                tp = tp + 1.0
            else:
                fp = fp + 1.0
        else:
            if dataset['gold'].iloc[gt_idx] == 1:
                fn = fn + 1.0
            else:
                tn = tn + 1.0
    return 2.0 * tp / (2.0 * tp + fp + fn)



In [54]:

    
F1_crowdtruth = compute_F1_score(crowdtruth)
print(F1_crowdtruth[F1_crowdtruth[:,1].argsort()][-10:])









    



[[0.57       0.92229299]
 [0.5        0.92269939]
 [0.56       0.92288243]
 [0.55       0.92346299]
 [0.54       0.92518703]
 [0.49       0.9253366 ]
 [0.53       0.92537313]
 [0.48       0.92570037]
 [0.51       0.92707046]
 [0.52       0.9280397 ]]



In [55]:

    
F1_mace = compute_F1_score(mace)
print(F1_mace[F1_mace[:,1].argsort()][-10:])









    



[[0.19       0.92793932]
 [0.13       0.92812106]
 [0.12       0.92830189]
 [0.09       0.92866083]
 [0.07       0.92901619]
 [0.14       0.92929293]
 [0.16       0.92929293]
 [0.15       0.92929293]
 [0.1        0.92982456]
 [0.11       0.93099122]]



In [62]:

    
F1_majority_vote = compute_majority_vote(majvote, 'value')
F1_majority_vote









    Out[62]:





0.4257142857142857



In [ ]:

	!amt_annotation_ids	!amt_worker_ids	orig_id	response	start	end	hypothesis	task	text
0	1	A2K5ICP43ML4PW	25	not_relevant	Mon Mar 25 07:39:42 PDT 2019	Mon Mar 25 07:41:05 PDT 2019	Two films won six Oscars.	IR	The film was the evening's big winner, ba...
1	2	A15L6WGIK3VU7N	25	not_relevant	Mon Mar 25 07:39:42 PDT 2019	Mon Mar 25 07:41:05 PDT 2019	Two films won six Oscars.	IR	The film was the evening's big winner, ba...
2	3	AHPSMRLKAEJV	25	not_relevant	Mon Mar 25 07:39:42 PDT 2019	Mon Mar 25 07:41:05 PDT 2019	Two films won six Oscars.	IR	The film was the evening's big winner, ba...
3	4	A25QX7IUS1KI5E	25	not_relevant	Mon Mar 25 07:39:42 PDT 2019	Mon Mar 25 07:41:05 PDT 2019	Two films won six Oscars.	IR	The film was the evening's big winner, ba...
4	5	A2RV3FIO3IAZS	25	not_relevant	Mon Mar 25 07:39:42 PDT 2019	Mon Mar 25 07:41:05 PDT 2019	Two films won six Oscars.	IR	The film was the evening's big winner, ba...

	output.response	output.response.count	output.response.unique	started	unit	submitted	worker	duration	job
judgment
1	{u'not_relevant': 1, u'relevant': 0}	1	2	2019-03-25 07:39:42-07:00	25	2019-03-25 07:41:05-07:00	A2K5ICP43ML4PW	83	../data/rte.standardized
2	{u'not_relevant': 1, u'relevant': 0}	1	2	2019-03-25 07:39:42-07:00	25	2019-03-25 07:41:05-07:00	A15L6WGIK3VU7N	83	../data/rte.standardized
3	{u'not_relevant': 1, u'relevant': 0}	1	2	2019-03-25 07:39:42-07:00	25	2019-03-25 07:41:05-07:00	AHPSMRLKAEJV	83	../data/rte.standardized
4	{u'not_relevant': 1, u'relevant': 0}	1	2	2019-03-25 07:39:42-07:00	25	2019-03-25 07:41:05-07:00	A25QX7IUS1KI5E	83	../data/rte.standardized
5	{u'not_relevant': 1, u'relevant': 0}	1	2	2019-03-25 07:39:42-07:00	25	2019-03-25 07:41:05-07:00	A2RV3FIO3IAZS	83	../data/rte.standardized

	duration	input.gold	input.hypothesis	input.task	input.text	job	output.response	output.response.annotations	output.response.unique_annotations	worker	uqs	unit_annotation_score	uqs_initial	unit_annotation_score_initial
unit
25	83	0	Two films won six Oscars.	IR	The film was the evening's big winner, ba...	../data/rte.standardized	{u'not_relevant': 8, u'relevant': 2}	10	2	10	0.754990	{u'not_relevant': 0.874879061295, u'relevant':...	0.644444	{u'not_relevant': 0.8, u'relevant': 0.2}
35	83	1	Saudi Arabia is the world's biggest oil e...	PP	Saudi Arabia, the biggest oil producer in the ...	../data/rte.standardized	{u'not_relevant': 6, u'relevant': 4}	10	2	10	0.529058	{u'not_relevant': 0.700282390791, u'relevant':...	0.466667	{u'not_relevant': 0.6, u'relevant': 0.4}
39	83	1	Bill Clinton received a reported $10 million a...	PP	Mr. Clinton received a hefty advance for the b...	../data/rte.standardized	{u'not_relevant': 1, u'relevant': 9}	10	2	10	0.877036	{u'not_relevant': 0.0580681587159, u'relevant'...	0.800000	{u'not_relevant': 0.1, u'relevant': 0.9}
48	83	1	Clinton is articulate.	PP	Clinton is a very charismatic person.	../data/rte.standardized	{u'not_relevant': 6, u'relevant': 4}	10	2	10	0.526438	{u'not_relevant': 0.697360027313, u'relevant':...	0.466667	{u'not_relevant': 0.6, u'relevant': 0.4}
49	83	1	Argentina sees upsurge in kidnappings.	IR	Kidnappings in Argentina have increased more t...	../data/rte.standardized	{u'not_relevant': 0, u'relevant': 10}	10	1	10	1.000000	{u'not_relevant': 0.0, u'relevant': 1.0}	1.000000	{u'not_relevant': 0.0, u'relevant': 1.0}

	duration	job	judgment	unit	wqs	wwa	wsa	wqs_initial	wwa_initial	wsa_initial
worker
A11GX90QFWDLMM	83	1	760	760	0.342472	0.544462	0.629009	0.334467	0.525292	0.636726
A14JQX7IFAICP0	83	1	180	180	0.414908	0.595823	0.696362	0.379697	0.555556	0.683455
A14Q86RX5HGCN	83	1	20	20	0.821105	0.852668	0.962983	0.733456	0.788889	0.929733
A14WWG6NKBDWGP	83	1	20	20	0.678581	0.754595	0.899265	0.571740	0.677778	0.843551
A151VN1BOY29J1	83	1	40	40	0.690083	0.780349	0.884326	0.582732	0.697222	0.835791

	output.response	aqs	aqs_initial
not_relevant	8000	0.706485	0.616694
relevant	8000	0.793250	0.715569

	duration	input.gold	input.hypothesis	input.task	input.text	job	output.response	output.response.annotations	output.response.unique_annotations	worker	uqs	unit_annotation_score	uqs_initial	unit_annotation_score_initial
unit
1017	83	1	The Pamplona fiesta has been celebrated for ce...	RC	The centuries-old Pamplona fiesta in honor of ...	../data/rte.standardized	{u'not_relevant': 0, u'relevant': 10}	10	1	10	1.0	{u'not_relevant': 0.0, u'relevant': 1.0}	1.0	{u'not_relevant': 0.0, u'relevant': 1.0}

	duration	input.gold	input.hypothesis	input.task	input.text	job	output.response	output.response.annotations	output.response.unique_annotations	worker	uqs	unit_annotation_score	uqs_initial	unit_annotation_score_initial
unit
1521	83	1	Apartheid in South Africa was abolished in 1990.	QA	On 2 February 1990, at the opening of Parliame...	../data/rte.standardized	{u'not_relevant': 5, u'relevant': 5}	10	2	10	0.439065	{u'not_relevant': 0.492508591773, u'relevant':...	0.444444	{u'not_relevant': 0.5, u'relevant': 0.5}

	unit	true	false	gold
0	25	9.263818e-06	9.999907e-01	0
1	35	1.353801e-06	9.999986e-01	1
2	39	1.000000e+00	1.057981e-08	1
3	48	5.895417e-07	9.999994e-01	1
4	49	9.999986e-01	1.420726e-06	1

	worker	competence
0	A2K5ICP43ML4PW	0.804252
1	A15L6WGIK3VU7N	0.855198
2	AHPSMRLKAEJV	0.690429
3	A25QX7IUS1KI5E	0.473449
4	A2RV3FIO3IAZS	0.348140